Nowadays, AI-generated content is in a high trend because of advancement in technology. Deepfake audio clips looks highly realistic, posing serious concern to ethics, privacy, and security. Although, there are so many techniques has been built to classify fake from real speech, but still there exists issues. This is due to noise, compression, and speaker variations. In this study, we evaluate four different models on the “In-the-Wild” audio deepfake dataset using a balanced subset of 20,000 samples standardized to 16 kHz and 2-second clips. Two models use hand-crafted features (Wavelet + ANN and FFT + ANN), while two models apply deep transfer learning using ResNet18 and fully fine-tuned ResNet50 on log-Mel spectrograms. Experimental results show that traditional ANN models achieve moderate performance, from 58%–75% in accuracy with higher false claims. In contrast to them, deep learning models shown better generalization, with ResNet18 reaching to 97% accuracy and ResNet50 achieving the best performance at 98.9% accuracy with near-perfect F1-scores. These findings highlight that spectrogram-based representations combined with powerful pre-trained CNN architectures provide a more robust and reliable solution for real-world audio deepfake detection.
Introduction
The rapid advancement of generative AI has enabled the creation of realistic synthetic content such as text, images, videos, and audio, commonly known as Deepfake technology. While this technology has beneficial applications, it also raises serious ethical, privacy, and security concerns, particularly when AI-generated speech is used to impersonate individuals or spread misinformation. Deepfake audio can influence political decisions, damage reputations, and even create false digital evidence in legal contexts. As a result, there is an increasing need for reliable systems that can detect AI-generated speech in real time.
To address this issue, the study uses the In-the-Wild audio dataset, which contains both real and AI-generated speech samples. The research evaluates four different models for deepfake speech detection. Two models use traditional signal-processing techniques—wavelet transformation and Fourier transformation—for feature extraction, followed by classification using Artificial Neural Networks (ANN). The other two models apply deep learning architectures, including ResNet18 combined with ANN and ResNet50, using spectrogram image representations of audio signals. The performance of these models is compared using evaluation metrics such as accuracy, precision, recall, and F1-score.
The literature review highlights recent research on audio deepfake detection using both traditional machine learning and deep learning methods. Previous studies have used techniques such as Mel-Frequency Cepstral Coefficients (MFCC), CNNs, Random Forest, and transfer learning models like VGG19 and VGG16. Findings generally show that deep learning approaches outperform traditional models, though challenges remain in areas such as noise handling, dimensionality reduction, scalability, and generalization to unseen data.
The research methodology uses a balanced dataset of 20,000 audio samples (10,000 real and 10,000 fake), split into training and testing sets in an 80:20 ratio. Audio files are standardized by converting them to mono, fixing the sampling rate at 16 kHz, and ensuring a 2-second duration. Three main feature extraction pipelines are implemented: wavelet-based features using Discrete Wavelet Transform (DWT), frequency-domain features using Fast Fourier Transform (FFT), and log-Mel spectrogram image representations for deep learning models.
Four classification models are then applied: ANN-Wavelet, ANN-FFT, ResNet18 combined with ANN, and ResNet50 with full fine-tuning. These models analyze the extracted features to classify audio as real or deepfake using a binary classification approach.
Overall, the research aims to determine the most effective model for detecting deepfake audio by comparing different feature extraction techniques and machine learning architectures, contributing to the development of more robust and reliable deepfake detection systems.
Conclusion
This study compared four different methods, wavelet, Fourier, Resnet-18 and 50 for detecting audio deepfakes using the challenging “In-the-Wild” dataset. The results clearly show that deep learning methods perform much better. The wavelet and FFT-based ANN models had lower accuracy, especially in detecting real speech correctly. In contrast, the ResNet-based models achieved very high performance within just five training epochs. These results show that spectrogram-based representations combined with powerful pre-trained CNN models are much more effective for detecting real-world audio deepfakes than traditional hand-crafted features. Future work can focus on combining models, improving noise robustness, testing newer generative speech models, and developing real-time detection systems. Overall, deep transfer learning proves to be a strong and reliable solution for tackling audio deepfake threats.
References
[1] J. Kietzmann, L. W. Lee, I. P. McCarthy, and T. C. Kietzmann, \"Deepfakes: Trick or treat?,\" Business Horizons, vol. 63, no. 2, pp. 135–146, 2020.
[2] B. Paris and J. Donovan, \"Deepfakes and cheap fakes,\" Data & Society, p. 47, 2019.
[3] N. Eldien, R. Ali, and F. Moussa, \"Real and fake face detection: A comprehensive evaluation of machine learning and deep learning techniques for improved performance,\" pp. 315–320, Jul. 2023.
[4] S. Ahmed, \"Who inadvertently shares deepfakes? Analyzing the role of political interest, cognitive ability, and social network size,\" Telematics and Informatics, vol. 57, p. 101508, 2021.
[5] A. Lieto, D. Moro, F. Devoti, C. Parera, V. Lipari, P. Bestagini, and S. Tubaro, \"\'Hello? Who am I talking to?\' A shallow CNN approach for human vs. bot speech classification,\" in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 2577–2581.
[6] P. Yu, Z. Xia, J. Fei, and Y. Lu, \"A survey on deepfake video detection,\" IET Biometrics, vol. 10, no. 6, pp. 607–624, 2021.
[7] J. Truby and R. Brown, \"Human digital thought clones: The holy grail of artificial intelligence for big data,\" Information & Communications Technology Law, vol. 30, no. 2, pp. 140–168, 2021.
[8] M. Waldrop, \"Synthetic media: The real trouble with deepfakes,\" Knowable Magazine, vol. 3, 2020.
[9] R. Wijethunga, D. Matheesha, A. Al Noman, K. De Silva, M. Tissera, and L. Rupasinghe, \"Deepfake audio detection: A deep learning-based solution for group conversations,\" in Proc. 2nd Int. Conf. Advancements in Computing (ICAC), 2020, pp. 192–197.
[10] A. Hamza, A. R. R. Javed, F. Iqbal, N. Kryvinska, A. S. Almadhor, Z. Jalil, and R. Borghol, \"Deepfake audio detection via MFCC features using machine learning,\" IEEE Access, vol. 10, pp. 134018–134028, 2022.
[11] V. Kumar, A. Kapoor, R. R. Chaudhary, L. Gupta, and D. Khokhar, \"Preserving integrity: A binary classification approach to unmasking artificially generated voices in the age of deepfakes,\" in Proc. 11th Int. Conf. Computing for Sustainable Global Development (INDIACom), 2024, pp. 1449–1454.
[12] J. Khochare, C. Joshi, B. Yenarkar, S. Suratkar, and F. Kazi, \"A deep learning framework for audio deepfake detection,\" Arabian Journal for Science and Engineering, pp. 1–12, 2021.